gold document
Beyond Static Retrieval: Opportunities and Pitfalls of Iterative Retrieval in GraphRAG
Guo, Kai, Dai, Xinnan, Zeng, Shenglai, Shomer, Harry, Han, Haoyu, Wang, Yu, Tang, Jiliang
Retrieval-augmented generation (RAG) is a powerful paradigm for improving large language models (LLMs) on knowledge-intensive question answering. Graph-based RAG (GraphRAG) leverages entity-relation graphs to support multi-hop reasoning, but most systems still rely on static retrieval. When crucial evidence, especially bridge documents that connect disjoint entities, is absent, reasoning collapses and hallucinations persist. Iterative retrieval, which performs multiple rounds of evidence selection, has emerged as a promising alternative, yet its role within GraphRAG remains poorly understood. We present the first systematic study of iterative retrieval in GraphRAG, analyzing how different strategies interact with graph-based backbones and under what conditions they succeed or fail. Our findings reveal clear opportunities: iteration improves complex multi-hop questions, helps promote bridge documents into leading ranks, and different strategies offer complementary strengths. At the same time, pitfalls remain: naive expansion often introduces noise that reduces precision, gains are limited on single-hop or simple comparison questions, and several bridge evidences still be buried too deep to be effectively used. Together, these results highlight a central bottleneck, namely that GraphRAG's effectiveness depends not only on recall but also on whether bridge evidence is consistently promoted into leading positions where it can support reasoning chains. To address this challenge, we propose Bridge-Guided Dual-Thought-based Retrieval (BDTR), a simple yet effective framework that generates complementary thoughts and leverages reasoning chains to recalibrate rankings and bring bridge evidence into leading positions. BDTR achieves consistent improvements across diverse GraphRAG settings and provides guidance for the design of future GraphRAG systems.
- North America > United States > Texas (0.04)
- North America > United States > Oregon (0.04)
- North America > United States > Michigan (0.04)
- (4 more...)
The Curious Case of Factuality Finetuning: Models' Internal Beliefs Can Improve Factuality
Newman, Benjamin, Ravichander, Abhilasha, Jung, Jaehun, Xin, Rui, Ivison, Hamish, Kuznetsov, Yegor, Koh, Pang Wei, Choi, Yejin
Language models are prone to hallucination - generating text that is factually incorrect. Finetuning models on high-quality factual information can potentially reduce hallucination, but concerns remain; obtaining factual gold data can be expensive and training on correct but unfamiliar data may potentially lead to even more downstream hallucination. What data should practitioners finetune on to mitigate hallucinations in language models? In this work, we study the relationship between the factuality of finetuning data and the prevalence of hallucinations in long-form generation tasks. Counterintuitively, we find that finetuning on factual gold data is not as helpful as finetuning on model-generated data that models believe to be factual. Next, we evaluate filtering strategies applied on both factual gold data and model-generated data, and find that finetuning on model-generated data that is filtered by models' own internal judgments often leads to better overall factuality compared to other configurations: training on gold data filtered by models' judgments, training on gold data alone, or training on model-generated data that is supported by gold data. These factuality improvements transfer across three domains we study, suggesting that a models' own beliefs can provide a powerful signal for factuality.
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Asia > Singapore (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- (17 more...)
- Law (1.00)
- Health & Medicine > Therapeutic Area > Dermatology (1.00)
- Government > Voting & Elections (1.00)
- (5 more...)
Knowledge-Aware Diverse Reranking for Cross-Source Question Answering
This paper presents Team Marikarp's solution for the SIGIR 2025 LiveRAG competition. The competition's evaluation set, automatically generated by DataMorgana from internet corpora, encompassed a wide range of target topics, question types, question formulations, audience types, and knowledge organization methods. It offered a fair evaluation of retrieving question-relevant supporting documents from a 15M documents subset of the FineWeb corpus. Our proposed knowledge-aware diverse reranking RAG pipeline achieved first place in the competition.
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.83)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.73)
Masking in Multi-hop QA: An Analysis of How Language Models Perform with Context Permutation
Huang, Wenyu, Vougiouklis, Pavlos, Lapata, Mirella, Pan, Jeff Z.
Multi-hop Question Answering (MHQA) adds layers of complexity to question answering, making it more challenging. When Language Models (LMs) are prompted with multiple search results, they are tasked not only with retrieving relevant information but also employing multi-hop reasoning across the information sources. Although LMs perform well on traditional question-answering tasks, the causal mask can hinder their capacity to reason across complex contexts. In this paper, we explore how LMs respond to multi-hop questions by permuting search results (retrieved documents) under various configurations. Our study reveals interesting findings as follows: 1) Encoder-decoder models, such as the ones in the Flan-T5 family, generally outperform causal decoder-only LMs in MHQA tasks, despite being significantly smaller in size; 2) altering the order of gold documents reveals distinct trends in both Flan T5 models and fine-tuned decoder-only models, with optimal performance observed when the document order aligns with the reasoning chain order; 3) enhancing causal decoder-only models with bi-directional attention by modifying the causal mask can effectively boost their end performance. In addition to the above, we conduct a thorough investigation of the distribution of LM attention weights in the context of MHQA. Our experiments reveal that attention weights tend to peak at higher values when the resulting answer is correct. We leverage this finding to heuristically improve LMs' performance on this task. Our code is publicly available at https://github.com/hwy9855/MultiHopQA-Reasoning.
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- (2 more...)
CAFE: Retrieval Head-based Coarse-to-Fine Information Seeking to Enhance Multi-Document QA Capability
Peng, Han, Jiang, Jinhao, Dong, Zican, Zhao, Wayne Xin, Fang, Lei
Advancements in Large Language Models (LLMs) have extended their input context length, yet they still struggle with retrieval and reasoning in long-context inputs. Existing methods propose to utilize the prompt strategy and retrieval head to alleviate this limitation. However, they still face challenges in balancing retrieval precision and recall, impacting their efficacy in answering questions. To address this, we introduce $\textbf{CAFE}$, a two-stage coarse-to-fine method to enhance multi-document question-answering capacities. By gradually eliminating the negative impacts of background and distracting documents, CAFE makes the responses more reliant on the evidence documents. Initially, a coarse-grained filtering method leverages retrieval heads to identify and rank relevant documents. Then, a fine-grained steering method guides attention to the most relevant content. Experiments across benchmarks show CAFE outperforms baselines, achieving up to 22.1% and 13.7% SubEM improvement over SFT and RAG methods on the Mistral model, respectively.
- Europe > Austria > Vienna (0.14)
- North America > United States > Florida > Miami-Dade County > Miami (0.14)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- (4 more...)
Likelihood as a Performance Gauge for Retrieval-Augmented Generation
Liu, Tianyu, Qi, Jirui, He, Paul, Bisazza, Arianna, Sachan, Mrinmaya, Cotterell, Ryan
Recent work finds that retrieval-augmented generation with large language models is prone to be influenced by the order of retrieved documents in the context. However, the lack of in-depth analysis limits the use of this phenomenon for prompt engineering in practice. In this study, we posit that likelihoods serve as an effective gauge for language model performance. Through experiments on two question-answering datasets with a variety of state-of-the-art language models, we reveal correlations between answer accuracy and the likelihood of the question at both the corpus level and the instance level. In addition, we find that question likelihood can also indicate the position of the task-relevant information in the context. Based on these findings, we propose two methods that use question likelihood as a gauge for selecting and constructing prompts that lead to better performance. We demonstrate their effectiveness with experiments. In addition, our likelihood-based methods are efficient, as they only need to compute the likelihood of the input, requiring much fewer language model passes than heuristic prompt engineering methods that require generating responses. Our analysis deepens our understanding of how input prompts affect model performance and provides a promising direction for efficient prompt optimization.
- North America > Canada > Ontario > Toronto (0.14)
- Europe > United Kingdom (0.04)
- Asia > Singapore (0.04)
- (8 more...)
Toward Optimal Search and Retrieval for RAG
Leto, Alexandria, Aguerrebere, Cecilia, Bhati, Ishwar, Willke, Ted, Tepper, Mariano, Vo, Vy Ai
Retrieval-augmented generation (RAG) is a promising method for addressing some of the memory-related challenges associated with Large Language Models (LLMs). Two separate systems form the RAG pipeline, the retriever and the reader, and the impact of each on downstream task performance is not well-understood. Here, we work towards the goal of understanding how retrievers can be optimized for RAG pipelines for common tasks such as Question Answering (QA). We conduct experiments focused on the relationship between retrieval and RAG performance on QA and attributed QA and unveil a number of insights useful to practitioners developing high-performance RAG pipelines. For example, lowering search accuracy has minor implications for RAG performance while potentially increasing retrieval speed and memory efficiency.
- Asia > Singapore (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > Colorado > Boulder County > Boulder (0.04)
- (2 more...)
Reducing Distraction in Long-Context Language Models by Focused Learning
Wu, Zijun, Liu, Bingyuan, Yan, Ran, Chen, Lei, Delteil, Thomas
Recent advancements in Large Language Models (LLMs) have significantly enhanced their capacity to process long contexts. However, effectively utilizing this long context remains a challenge due to the issue of distraction, where irrelevant information dominates lengthy contexts, causing LLMs to lose focus on the most relevant segments. To address this, we propose a novel training method that enhances LLMs' ability to discern relevant information through a unique combination of retrieval-based data augmentation and contrastive learning. Specifically, during fine-tuning with long contexts, we employ a retriever to extract the most relevant segments, serving as augmented inputs. We then introduce an auxiliary contrastive learning objective to explicitly ensure that outputs from the original context and the retrieved sub-context are closely aligned. Extensive experiments on long single-document and multi-document QA benchmarks demonstrate the effectiveness of our proposed method.
- North America > Canada > Alberta (0.14)
- Asia > India > Maharashtra > Mumbai (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Research Report (0.82)
- Personal > Honors (0.70)
- Leisure & Entertainment (1.00)
- Media > Film (0.49)
MaFeRw: Query Rewriting with Multi-Aspect Feedbacks for Retrieval-Augmented Large Language Models
Wang, Yujing, Zhang, Hainan, Pang, Liang, Pang, Liang, Zheng, Hongwei, Zheng, Zhiming
In a real-world RAG system, the current query often involves spoken ellipses and ambiguous references from dialogue contexts, necessitating query rewriting to better describe user's information needs. However, traditional context-based rewriting has minimal enhancement on downstream generation tasks due to the lengthy process from query rewriting to response generation. Some researchers try to utilize reinforcement learning with generation feedback to assist the rewriter, but these sparse rewards provide little guidance in most cases, leading to unstable training and generation results. We find that user's needs are also reflected in the gold document, retrieved documents and ground truth. Therefore, by feeding back these multi-aspect dense rewards to query rewriting, more stable and satisfactory responses can be achieved. In this paper, we propose a novel query rewriting method MaFeRw, which improves RAG performance by integrating multi-aspect feedback from both the retrieval process and generated results. Specifically, we first use manual data to train a T5 model for the rewriter initialization. Next, we design three metrics as reinforcement learning feedback: the similarity between the rewritten query and the gold document, the ranking metrics, and ROUGE between the generation and the ground truth. Inspired by RLAIF, we train three kinds of reward models for the above metrics to achieve more efficient training. Finally, we combine the scores of these reward models as feedback, and use PPO algorithm to explore the optimal query rewriting strategy. Experimental results on two conversational RAG datasets demonstrate that MaFeRw achieves superior generation metrics and more stable training compared to baselines.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > Puerto Rico (0.06)
- Europe > France (0.05)
- (9 more...)
Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization
Hsieh, Cheng-Yu, Chuang, Yung-Sung, Li, Chun-Liang, Wang, Zifeng, Le, Long T., Kumar, Abhishek, Glass, James, Ratner, Alexander, Lee, Chen-Yu, Krishna, Ranjay, Pfister, Tomas
Large language models (LLMs), even when specifically trained to process long input contexts, struggle to capture relevant information located in the middle of their input. This phenomenon has been known as the lost-in-themiddle problem. In this work, we make three contributions. First, we set out to understand the factors that cause this phenomenon. In doing so, we establish a connection between lost-in-the-middle to LLMs' intrinsic attention bias: LLMs exhibit an U-shaped attention bias where the tokens at the beginning and at the end of its input receive higher attention, regardless Figure 1: (a) Lost-in-the-middle refers to models' U-of their relevance. Second, we mitigate shape RAG performance as the relevant context's (e.g., this positional bias through a calibration a gold document containing the answer to a query) position mechanism, found-in-the-middle, that allows varies within the input; (b) We observe models the model to attend to contexts faithfully according exhibit U-shape attention weights favoring leading and to their relevance, even though when ending contexts, regardless of their actual contents; (c) they are in the middle. Third, we show foundin-the-middle Models do attend to relevant contexts even when placed not only achieves better performance in the middle, but are eventually distracted by leading/ending in locating relevant information within contexts; (d) We propose a calibration mechanism, a long context, but also eventually leads to improved found-in-the-middle, that disentangles the effect retrieval-augmented generation (RAG) of U-shape attention bias and allows models to attend performance across various tasks, outperforming to relevant context regardless their positions.
- Asia > Middle East > Jordan (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- (2 more...)